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isjffl- Word sense JJisambi^ation • . , 
(WSD) is that of dictionary-based methods. Various al|?rft0(^Si O R'tTOI' V©rDS 

algorithm, which exploits the sense definitions in the dictionary directly. Our approach 
uses the lexical base WordNet for a new algorithm originated in Lesk's, namely chain 
algorithm for disambiguation of all words (CHAD). We show how translation from a 
language into another one and also text entailment verification could be accomplished by 



this disambiguation. 



\ -The edlysemy 

I \ / - 

Word sense disambiguation is the process x)f identifying the correct sense of words 

in particular contexts. Thp" solving of WSD seapis to be AI complete ( that means its 

solution requires a sdlution to all the general AI prt^blems of representing and reasoning 

about arbitrary) "aruj it is^qne of the most importantxppen problems in NLP |5 1,|6|,||7|, 

lfT0l .lfT21.l IB I. In_tlie_elfeet£onital on-line dictionary WordNet, the ma^V jvell-developed 

and widely used lexical databaSe-^T^ngJish. tjie polysem"y^of different cate^pry of word> ' 

is presented in order as: the highest for vero^Nthen for nouns, aad the lowesYfor adfe'c- 

tives and adverbs. Usually, the process of disambiguation is rg^lized for a singly target 

word. One would expect the words closest to the\tyge"t 'Word -to. be. of gre ater semantical 

importance tor it than the other words in the text. ^hgxenJS^ff is ReilCeTss^ur 

mation to identify the meaning of the polysemous wards. The contexts may^be-used.'jn 

two ways: a) as bag of words, without consideration of relationships with the target word ■ 

in terms of pistance, grammatical relations, etc.; b) with relational information. The bag 

of words ap] )roach works better for nouns than verbs but is less effective than methods that 

take other r ;lations in consideration. Studies about syntactic relations determined some 

interesting c onclusions: verbs derive more disambiguation information from their objects 

than from tl eir subjects, adjectives derive almost all disambiguation information from the 

Qpiyjs^e}/^ Tiodify, and nouns are best disambiguated by directlv adjacent adjectives or 

2000 Mathefttrtys Subjtl^lassifi^^ll^^n. eSTjgjrJSHeS. Qf^ 

Key words a]^^Lfihrases.,^l^T>, ma(^JlTBp translajioo, text ejjt^ment. 
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nouns ||5]. All these advocate that a global approach (disambiguation of all words) helps 
to disambiguate each POS. 

In this paper we propose a global disambiguation algorithm called chain algorithm for 
disambiguation, CHAD, which presents elements of both points of view about a context: 
because this algorithm is order sensitive it belongs to the class of algorithms which 
depend of relational information; in the same time it doesn't require syntactic analysis 
and syntactic parsing. 

In section 2 of this paper we review Lesk's algorithm for WSD. In section 3 we present 
"triplet" algorithm for three words and CHAD algorithm. In section 4 we describe some 
experiments and evaluations with CHAD. Section 5 introduces some conclusions of us- 
ing the CHAD for translation (here from Romanian language to English) and for text 
entailment verification. Section 6 draws some conclusions and further work. 



2. Dictionary-based methods 

Work in WSD reached a turning point in the 1980s when large-scale lexical resources, 
such as machine readable dictionaries, became widely available. One of the best known 
dictionary-based method is that of Lesk (1986). It starts from the idea that a word's 
dictionary definition is a good indicator for the senses of this word and uses the definition 
in the dictionary directly. 

Let us remember basic algorithm of Lesk ||8]: 

Suppose that for a polysemic target word w there are in a dictionary Ns senses 
si, S2, • • • , given in an equal number of definitions Di, D2, ■ ■ ■ , Dns- Here we 
mean by Di the set of words contained in the i-th definition. 

Consider that the new context to be disambiguated is Cnew The reduced form of 
Lesk's algorithm is: 

for fc = 1, Ns do 

score{sk) =| Dfc n (U^.gc^e^.lwj}) 
endfor 

Calculate s' = argmaxkScore{sk) 

The score of a sense is the number of words that are shared by the different sense 
definitions (glosses) and the context. A target word is assigned that sense whose gloss 
shares the largest number of words. 

The algorithm of Lesk was successfully developed in |2 1 by using WordNet dictionary 
for English. It was created by hand in 1990s and includes definitions (glosses) for indi- 
vidual senses of words, as in a dictionary. Additionally it defines groups of synonymous 
words representing the same lexical concept (synset) and organizes them into a conceptual 
hierarchy. The paper |2l uses this conceptual hierarchy for improving the original Lesk's 
method by augmenting the definitions with non-gloss information: synonyms, examples 
and glosses of related words (hypernyms, hyponyms). Also, the authors introduced a 
novel overlap measure between glosses which favorites multi-word matching. 
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3. Chain algorithm for word sense disambiguation - CHAD. 

First of all we present an algorithm for disambiguation of a triplet. In a sense, our 
triplet algorithm is similar with global disambiguation algorithm for a window of two 
words around a target word given f2\. Instead, our CHAD realizes disambiguation of all- 
words in a text with any length, ignoring the notion of "window" and "target word" and 
target word in similar studies, all that without increasing the computational complexity. 

The algorithm for disambiguation of a triplet of words 'W1W2W3 for Dice measure is 
the following: 
begin 

for each sense s^,^ do 
for each sense si,^ do 
for each sense s'^^ do 
score(i,j, fc) = 3 x 
endfor 
endfor 
endfor 

{i* ,j* , k*) — argmaX(^ij^ic-)Score{i, j, k) I* sense of wi is s\^^, sense of W2 

is si,2 ■, sense of w-i, is */ 
end 

For the overlap measure the score is calculated as: score{i, j, k) = „^in^i^'Q^'~'^^i^'^^Jl)'' 



l-Drai nD„2nl3„3 1 
D»ll + |D„,| + |C„3| 



For the Jaccard measure the score is calculates as: score{i,j, k) — 1^ 



|D„iUD„2UD„3| 

Shortly, CHAD begins with the disambiguation of a triplet W1W2W3 and then adds 
to the right the following word to be disambiguated. Hence it disambiguates at a time 
a new triplet, where first two words are already associated with the best senses and the 
disambiguation of the third word depends on these first two words. CHAD algorithm for 
disambiguation of the sentence wiW2---Wm is: 

begin 

Disambiguate triplet w\W2W3, 
i = 4 

while i < TV do 



Calculate score(si) = 3 x 



\Ot,._^\ + \Di,._^\ + \D'J^\ 

Calculate s* := argmaXs-score{si) 
i:=i + l 
endwhile 
end 

Due to the brevity of definitions in WN many values of | i?^ _ H _ n 
0. We attributed the first sense in WN for s* in this cases. 
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4. Some experiments with chain algorithm. Experimental evaluation of 

CHAD 

In this section we shortly describe some experiments that we have made in order to 
vaUdate the proposed chain algorithm CHAD. 

4. 1 . Implementation details. We have developed an apphcation that implements CHAD 
and can be used to: 

• disambiguate words i4.2h 

• translate words into Romanian language (15. Il l: 

• text entailment verification (5.2). 

The application is written in JDK 1.5.0. and uses HttpUnit 1.6.2 API ifTsl . Written in 
Java, HttpUnit is a free software that emulates the relevant portions of browser behavior, 
including form submission, JavaScript, basic http authentication, cookies and automatic 
page redirection, and allows Java test code to examine returned pages either as text, an 
XML DOM, or containers of forms, tables, and links IfTSl . 

We have used HttpUnit in order to search WordNet through the dictionary from |fT6l . 
More specifically, the following Java classes from 1 15 1 are used: 

• WebConversation. It represents the context for a series of HTTP requests. 
This class manages cookies used to maintain session context, computes rela- 
tive URLs, and generally emulates the browser behavior needed to build an 
automated test of a web site. 

• WebResponse. This class represents a response to a web request from a web 
server. 

• WebForm. This class represents a form in an HTML page. Using this class we 
can examine the parameters defined for the form, the structure of the form (as 
a DOM), and the text of the form. We have used WebForm class in order to 
simulate the submission of the form with corresponding parameters. 

4.2. Results. We tested our CHAD on 10 files of Brown corpus, which are POS tagged. 
Recall that WN stores only stems of words. So, we first preprocessed the glosses and the 
input files, replacing inflected words with their stems. 

The reason for choosing Brown corpus was the possibility offered by SemCor corpus 
(the best known publicly available corpus hand tagged with WN senses) to evaluate the 
results. The correct disambiguated words means the disambiguated words as in SemCor. 
We ran separately CHAD for: 1. nouns, 2. verbs, and 3. nouns, verbs, adjectives and 
adverbs. In the case of CHAD addressed to nouns, the output is the sequence of nouns 
tagged with senses. The tag ?T.oun#n#i means that for noun noun the WN sense i was 
found. Analogously for the case of disambiguation on verbs and of all POS. The results 
are presented in tables 1 and 2. As our CHAD algorithm is dependent on the length of 
glosses, and as nouns have the longest glosses, the highest precision is obtained for nouns. 
In Figure 3, the Precision Progress can be traced. By dropping and rising, the precision 
finally stabilizes to value 0.767 (for the file Br-aOl). The most interesting part of this 
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graph is that he shows how this Chain Algorithm works and how the correct or incorrect 
disambiguation of first two words from the first triplet influences the disambiguation of 
the next words. 

It is known that, at Senseval 2 contest, only 2 out of the 7 teams (with the unsupervised 
methods) achieved higher precision than the WordNet I''* sense baseline. We compared in 
figures 1 , 2 and 3 the precision of CHAD for 10 files in Brown corpus, for Dice, Overlap 
and Jaccard measures with WordNet 1"* sense. 

Comparing the precision obtained with the Overlap Measure and the precision given 
by the WordNet 1"* sense for 10 files of Brown corpus (Br-aOl, Br-a02, Br-11, Br-12, 
Br-13, Br-14, Br-al5, Br-bl3, Br-b20 and Br-cOl), we obtained the following results: 

• for Nouns, the minimum difference was 0.0077, the maximum difference was 
0.0706, the average difference was 0.0338; 

• as a whole, for 4 files difference was greater or equal to 0.04, and for 6 files 
was lower; 

• in case of all Parts of Speech, the minimum difference was 0.0313, the maxi- 
mum difference was 0.0681, the average difference was 0.0491; 

• as a whole, for 7 files difference was greater or equal to 0.04, and for 3 files 
was lower; 

• relatively to Verbs, the minimum difference was 0.0078, the maximum differ- 
ence was 0.0591, the average difference was 0.0340; 

• as a whole, for 4 files difference was greater or equal to 0.04, and for 6 files 
was lower 

Let us remark that in our CHAD the standard concept of windows better size parameter 
||2l is not working: simply, a window is the variable space between the previous and the 
following word in respect to the current word. 
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Figure 1 . Noun Precision 
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File 


Words 


Dice 


Jaccard 


Overlap 


WNl 


BraOl 


486 


0.758 


0.758 


0.767 


0.800 


Bra02 


479 


0.735 


0.731 


0.758 


0.808 


Bral4 


401 


0.736 


0.736 


0.754 


0.769 


Brail 


413 


0.724 


0.726 


0.746 


0.773 


Brb20 


394 


0.740 


0.740 


0.743 


0.751 


Bra 13 


399 


0.734 


0.734 


0.739 


0.746 


Brbl3 


467 


0.708 


0.708 


0.717 


0.732 


Bra 12 


433 


0.696 


0.696 


0.710 


0.781 


Bra 15 


354 


0.677 


0.674 


0.682 


0.725 


BrcOl 


434 


0.653 


0.653 


0.661 


0.728 



Table 1 . Precision for Nouns, sorted descending by the precision of 
Overlap measure 




Figure 2. All Parts of Speech Precision 



5. Applications of CHAD algorithm 

5.1. Application to Romanian-English translation. WSD is only an intermediate task 
in NLP. In Machine Translation WSD is required for lexical choise for words that have 
different translation for different senses and that are potentially ambiguous within a given 
document. However, most Machine Translation models do not use explicit WSD [1] (in 
Introduction). 

The algorithm implemented by us consists in the translation word by word of a Roma- 
nian text (using dictionary at http://lit.csci. unt.edu '^rada/downloads/RoNLP/R.E tralexand), 
then the application of chain algorithm to the English text. As the translation of a Roma- 
nian word in English is multiple, the disambiguation of a triplet is modified as following. 
Let be the word wi with ki translations i™^ , the word W2 with k2 translations t'!^^ and the 
word W3 with translations t^, ^ . Each triplet f!^^ tf^^ is disambiguated with the triplet 
disambiguation algorithm and then the triplet with the maxim score is selected: 
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File 


Words 


Diee 


Jaceard 


Overlap 


WNl 


Bra 14 


931 


0.699 


0.701 


0.711 


0.742 


Bra02 


959 


0.637 


0.685 


0.697 


0.753 


Brb20 


930 


0.672 


0.674 


0.693 


0.731 


Bra 15 


1071 


0.653 


0.651 


0.684 


0.732 


Bra 13 


924 


0.667 


0.673 


0.682 


0.735 


BraOl 


1033 


0.650 


0.648 


0.674 


0.714 


Brbl3 


947 


0.649 


0.650 


0.674 


0.722 


Bra 12 


1163 


0.626 


0.622 


0.649 


0.717 


Brail 


1043 


0.634 


0.639 


0.648 


0.708 


BrcOl 


1100 


0.625 


0.627 


0.638 


0.688 



Table 2. Precision for all POS, sorted descending by the precision of 
Overlap measure 
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Figure 3 . Precision in progress 



begin 

for m = 1, fei do 
for n — 1 , fc2 do 
forp = 1, fcs do 
Disambiguate triplet t^I^^ C3 in (Ci)*(C2)*(tS<3)* 
Calculate score( (Ci ) * (C^ ) * (C3 ) * ) 
endfor 
endfor 
endfor 

Calculate (m*,n*,p*) = ar5maa;(^,„,p)Score((Ci)*(C2)*(C3)*) 
Optimal translation of triplet is (Cr)''(C2)*(C3r 
end 

Let us remark that (t™*)*, for example, is a synset which corresponds to the best 
translation for wi produced by CHAD algorithm. However, since in Romanian are used 
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many words linked by different spelling signs, these composed words are not found in 
the Romanian-English dictionary. Accordingly, not each Romanian word produces an 
English correspondent as output of the above algorithm. However, many translations are 
still correct. For example, the translation of expression vreme trece (in the poem "Glossa" 
of our national poet Mihai Eminescu), is Word: (Rom)vreme (Eng)Ageij^n^A , Word: 
(Rom)trece (Eng)Flow^v^l . As another example from the same poem, where the 
synset of a word occurs (as an output of our application), //ne toate minte, is translated 
in Word: (Rom) tine (Eng) Keep^v^S :{keep, maintain}. Word: (Rom) toate (Eng) 
All^adv^Z :{wholly, entirely, completely, totally, all, altogether, whole}. Word: (Rom) 
minte (Eng) Judgment^n^2 : {judgment, judgement, assessment}. 

5.2. Application to text entailment verification. The recognition of text entailment is 
one of the most complex task in Natural Language Understanding lfT4l . Thus, a very 
important problem in some computational linguistic applications (as question answering, 
summarization, segmentation of discourse, and others) is to establish if a textfollows from 
another text. For example, a QA system has to identify texts that entail the expected an- 
swer Similarly, in IR the concept denoted by a query expression should be entailed from 
relevant retrieved documents. In summarization, a redundant sentence should be entailed 
from other sentences in the summary. The application of WSD to text entailment verifi- 
cation is treated by authors in the paper "Text entailment verification with text similarity" 
in this Volume. 

6. Conclusions and further work 

In this paper we presented a new algorithm of word sense disambiguation. The algo- 
rithm is parametrized for: 1. all words (that means nouns, verbs, adjectives, adverbs); 2. 
all nouns; 3. all verbs. Some experiments with this algorithm for ten files of Brown 
corpus are presented in section 4.2. The stemming was realized using the list from 
http://snowball.tartarus.org/algorithms/porter/diffs.txt The precision is calculated rela- 
tive to the corresponding annotated files in SemCor corpus. Some details of implementa- 
tion are given in 4. 1 . 

We showed in section 5 how the disambiguation of a text helps in automated translation 
of a text from a language into another language: each word in the first text is translated 
into the most appropriated word in the second text. This appropriateness is considered 
from two points of view: 1. the point of view of possible translation and 2. the point of 
view of the real sense (disambiguated sense) of the second text. Some experiments with 
Romanian - English translations and text entailment verification are given (section 5). 

Another problem which we intend to address in the further work is that of optimization 
of a query in Information Retrieval. Finding whether a particular sense is connected with 
an instance of a word is likely the IR task of finding whether a document is relevant to a 
query. It is established that a good WSD program can improve performance of retrieval. 
As IR is used by millions of users, an average of some percentages of improvement could 
be seen as very significant. 
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Abstract. A large class of unsupervised algorithms for Word Sense Disambiguation 
(WSD) is that of dictionary-based methods. Various algorithms have as the root Lesk's 
algorithm, which exploits the sense definitions in the dictionary directly. Our approach 
uses the lexical base WordNet ( 3 1 for a new algorithm originated in Lesk's, namely chain 
algorithm for disambiguation of all words (CHAD). We show how translation from a 
language into another one and also text entailment verification could be accomplished by 
this disambiguation. 



1. The polysemy 

Word sense disambiguation is the process of identifying the correct sense of words 
in particular contexts. The solving of WSD seems to be AI complete ( that means its 
solution requires a solution to all the general AI problems of representing and reasoning 
about arbitrary) and it is one of the most important open problems in NLP ||5J,|i6J,||3, 
lfT0l . lfT2ll . |[T3l . In the electronical on-line dictionary WordNet, the most well-developed 
and widely used lexical database for English, the polysemy of different category of words 
is presented in order as: the highest for verbs, then for nouns, and the lowest for adjec- 
tives and adverbs. Usually, the process of disambiguation is realized for a single, target 
word. One would expect the words closest to the target word to be of greater semantical 
importance for it than the other words in the text. The context is hence a source of infor- 
mation to identify the meaning of the polysemous words. The contexts may be used in 
two ways: a) as bag of words, without consideration of relationships with the target word 
in terms of distance, grammatical relations, etc.; b) with relational information. The bag 
of words approach works better for nouns than verbs but is less effective than methods that 
take other relations in consideration. Studies about syntactic relations determined some 
interesting conclusions: verbs derive more disambiguation information from their objects 
than from their subjects, adjectives derive almost all disambiguation information from the 
nouns they modify, and nouns are best disambiguated by directly adjacent adjectives or 
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nouns ||5]. All these advocate that a global approach (disambiguation of all words) helps 
to disambiguate each POS. 

In this paper we propose a global disambiguation algorithm called chain algorithm for 
disambiguation, CHAD, which presents elements of both points of view about a context: 
because this algorithm is order sensitive it belongs to the class of algorithms which 
depend of relational information; in the same time it doesn't require syntactic analysis 
and syntactic parsing. 

In section 2 of this paper we review Lesk's algorithm for WSD. In section 3 we present 
"triplet" algorithm for three words and CHAD algorithm. In section 4 we describe some 
experiments and evaluations with CHAD. Section 5 introduces some conclusions of us- 
ing the CHAD for translation (here from Romanian language to English) and for text 
entailment verification. Section 6 draws some conclusions and further work. 



2. Dictionary-based methods 

Work in WSD reached a turning point in the 1980s when large-scale lexical resources, 
such as machine readable dictionaries, became widely available. One of the best known 
dictionary-based method is that of Lesk (1986). It starts from the idea that a word's 
dictionary definition is a good indicator for the senses of this word and uses the definition 
in the dictionary directly. 

Let us remember basic algorithm of Lesk ||8]: 

Suppose that for a polysemic target word w there are in a dictionary Ns senses 
si, S2, • • • , given in an equal number of definitions Di, D2, ■ ■ ■ , Dns- Here we 
mean by Di the set of words contained in the i-th definition. 

Consider that the new context to be disambiguated is Cnew The reduced form of 
Lesk's algorithm is: 

for fc = 1, Ns do 

score{sk) =| Dfc n (U^.gc^e^.lwj}) 
endfor 

Calculate s' = argmaxkScore{sk) 

The score of a sense is the number of words that are shared by the different sense 
definitions (glosses) and the context. A target word is assigned that sense whose gloss 
shares the largest number of words. 

The algorithm of Lesk was successfully developed in |2 1 by using WordNet dictionary 
for English. It was created by hand in 1990s and includes definitions (glosses) for indi- 
vidual senses of words, as in a dictionary. Additionally it defines groups of synonymous 
words representing the same lexical concept (synset) and organizes them into a conceptual 
hierarchy. The paper |2l uses this conceptual hierarchy for improving the original Lesk's 
method by augmenting the definitions with non-gloss information: synonyms, examples 
and glosses of related words (hypernyms, hyponyms). Also, the authors introduced a 
novel overlap measure between glosses which favorites multi-word matching. 
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3. Chain algorithm for word sense disambiguation - CHAD. 

First of all we present an algorithm for disambiguation of a triplet. In a sense, our 
triplet algorithm is similar with global disambiguation algorithm for a window of two 
words around a target word given f2\. Instead, our CHAD realizes disambiguation of all- 
words in a text with any length, ignoring the notion of "window" and "target word" and 
target word in similar studies, all that without increasing the computational complexity. 

The algorithm for disambiguation of a triplet of words 'W1W2W3 for Dice measure is 
the following: 
begin 

for each sense s^,^ do 
for each sense si,^ do 
for each sense s'^^ do 
score(i,j, fc) = 3 x 
endfor 
endfor 
endfor 

{i* ,j* , k*) — argmaX(^ij^ic-)Score{i, j, k) I* sense of wi is s\^^, sense of W2 

is si,2 ■, sense of w-i, is */ 
end 

For the overlap measure the score is calculated as: score{i, j, k) = „^in^i^'Q^'~'^^i^'^^Jl)'' 



l-Drai nD„2nl3„3 1 
D»ll + |D„,| + |C„3| 



For the Jaccard measure the score is calculates as: score{i,j, k) — 1^ 



|D„iUD„2UD„3| 

Shortly, CHAD begins with the disambiguation of a triplet W1W2W3 and then adds 
to the right the following word to be disambiguated. Hence it disambiguates at a time 
a new triplet, where first two words are already associated with the best senses and the 
disambiguation of the third word depends on these first two words. CHAD algorithm for 
disambiguation of the sentence wiW2---Wm is: 

begin 

Disambiguate triplet w\W2W3, 
i = 4 

while i < TV do 



Calculate score(si) = 3 x 



\Ot,._^\ + \Di,._^\ + \D'J^\ 

Calculate s* := argmaXs-score{si) 
i:=i + l 
endwhile 
end 

Due to the brevity of definitions in WN many values of | i?^ _ H _ n 
0. We attributed the first sense in WN for s* in this cases. 
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4. Some experiments with chain algorithm. Experimental evaluation of 

CHAD 

In this section we shortly describe some experiments that we have made in order to 
vaUdate the proposed chain algorithm CHAD. 

4. 1 . Implementation details. We have developed an apphcation that implements CHAD 
and can be used to: 

• disambiguate words i4.2h 

• translate words into Romanian language (15. Il l: 

• text entailment verification (5.2). 

The application is written in JDK 1.5.0. and uses HttpUnit 1.6.2 API ifTsl . Written in 
Java, HttpUnit is a free software that emulates the relevant portions of browser behavior, 
including form submission, JavaScript, basic http authentication, cookies and automatic 
page redirection, and allows Java test code to examine returned pages either as text, an 
XML DOM, or containers of forms, tables, and links IfTSl . 

We have used HttpUnit in order to search WordNet through the dictionary from |fT6l . 
More specifically, the following Java classes from 1 15 1 are used: 

• WebConversation. It represents the context for a series of HTTP requests. 
This class manages cookies used to maintain session context, computes rela- 
tive URLs, and generally emulates the browser behavior needed to build an 
automated test of a web site. 

• WebResponse. This class represents a response to a web request from a web 
server. 

• WebForm. This class represents a form in an HTML page. Using this class we 
can examine the parameters defined for the form, the structure of the form (as 
a DOM), and the text of the form. We have used WebForm class in order to 
simulate the submission of the form with corresponding parameters. 

4.2. Results. We tested our CHAD on 10 files of Brown corpus, which are POS tagged. 
Recall that WN stores only stems of words. So, we first preprocessed the glosses and the 
input files, replacing inflected words with their stems. 

The reason for choosing Brown corpus was the possibility offered by SemCor corpus 
(the best known publicly available corpus hand tagged with WN senses) to evaluate the 
results. The correct disambiguated words means the disambiguated words as in SemCor. 
We ran separately CHAD for: 1. nouns, 2. verbs, and 3. nouns, verbs, adjectives and 
adverbs. In the case of CHAD addressed to nouns, the output is the sequence of nouns 
tagged with senses. The tag ?T.oun#n#i means that for noun noun the WN sense i was 
found. Analogously for the case of disambiguation on verbs and of all POS. The results 
are presented in tables 1 and 2. As our CHAD algorithm is dependent on the length of 
glosses, and as nouns have the longest glosses, the highest precision is obtained for nouns. 
In Figure 3, the Precision Progress can be traced. By dropping and rising, the precision 
finally stabilizes to value 0.767 (for the file Br-aOl). The most interesting part of this 
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graph is that he shows how this Chain Algorithm works and how the correct or incorrect 
disambiguation of first two words from the first triplet influences the disambiguation of 
the next words. 

It is known that, at Senseval 2 contest, only 2 out of the 7 teams (with the unsupervised 
methods) achieved higher precision than the WordNet I''* sense baseline. We compared in 
figures 1 , 2 and 3 the precision of CHAD for 10 files in Brown corpus, for Dice, Overlap 
and Jaccard measures with WordNet 1"* sense. 

Comparing the precision obtained with the Overlap Measure and the precision given 
by the WordNet 1"* sense for 10 files of Brown corpus (Br-aOl, Br-a02, Br-11, Br-12, 
Br-13, Br-14, Br-al5, Br-bl3, Br-b20 and Br-cOl), we obtained the following results: 

• for Nouns, the minimum difference was 0.0077, the maximum difference was 
0.0706, the average difference was 0.0338; 

• as a whole, for 4 files difference was greater or equal to 0.04, and for 6 files 
was lower; 

• in case of all Parts of Speech, the minimum difference was 0.0313, the maxi- 
mum difference was 0.0681, the average difference was 0.0491; 

• as a whole, for 7 files difference was greater or equal to 0.04, and for 3 files 
was lower; 

• relatively to Verbs, the minimum difference was 0.0078, the maximum differ- 
ence was 0.0591, the average difference was 0.0340; 

• as a whole, for 4 files difference was greater or equal to 0.04, and for 6 files 
was lower 

Let us remark that in our CHAD the standard concept of windows better size parameter 
||2l is not working: simply, a window is the variable space between the previous and the 
following word in respect to the current word. 



Precision for Nouns 




— Dice 

— Jaccard 
Overlapp 

— WordNet 1 St sense 



T-cj-^^ocrjcocMiD'- 

00t-^C\It-t-t-^0 
COmcDCQCQCDCDCDmCQ 



Files 



Figure 1 . Noun Precision 
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File 


Words 


Dice 


Jaccard 


Overlap 


WNl 


BraOl 


486 


0.758 


0.758 


0.767 


0.800 


Bra02 


479 


0.735 


0.731 


0.758 


0.808 


Bral4 


401 


0.736 


0.736 


0.754 


0.769 


Brail 


413 


0.724 


0.726 


0.746 


0.773 


Brb20 


394 


0.740 


0.740 


0.743 


0.751 


Bra 13 


399 


0.734 


0.734 


0.739 


0.746 


Brbl3 


467 


0.708 


0.708 


0.717 


0.732 


Bra 12 


433 


0.696 


0.696 


0.710 


0.781 


Bra 15 


354 


0.677 


0.674 


0.682 


0.725 


BrcOl 


434 


0.653 


0.653 


0.661 


0.728 



Table 1 . Precision for Nouns, sorted descending by the precision of 
Overlap measure 




Figure 2. All Parts of Speech Precision 



5. Applications of CHAD algorithm 

5.1. Application to Romanian-English translation. WSD is only an intermediate task 
in NLP. In Machine Translation WSD is required for lexical choise for words that have 
different translation for different senses and that are potentially ambiguous within a given 
document. However, most Machine Translation models do not use explicit WSD [1] (in 
Introduction). 

The algorithm implemented by us consists in the translation word by word of a Roma- 
nian text (using dictionary at http://lit.csci. unt.edu '^rada/downloads/RoNLP/R.E tralexand), 
then the application of chain algorithm to the English text. As the translation of a Roma- 
nian word in English is multiple, the disambiguation of a triplet is modified as following. 
Let be the word wi with ki translations i™^ , the word W2 with k2 translations t'!^^ and the 
word W3 with translations t^, ^ . Each triplet f!^^ tf^^ is disambiguated with the triplet 
disambiguation algorithm and then the triplet with the maxim score is selected: 
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File 


Words 


Diee 


Jaceard 


Overlap 


WNl 


Bra 14 


931 


0.699 


0.701 


0.711 


0.742 


Bra02 


959 


0.637 


0.685 


0.697 


0.753 


Brb20 


930 


0.672 


0.674 


0.693 


0.731 


Bra 15 


1071 


0.653 


0.651 


0.684 


0.732 


Bra 13 


924 


0.667 


0.673 


0.682 


0.735 


BraOl 


1033 


0.650 


0.648 


0.674 


0.714 


Brbl3 


947 


0.649 


0.650 


0.674 


0.722 


Bra 12 


1163 


0.626 


0.622 


0.649 


0.717 


Brail 


1043 


0.634 


0.639 


0.648 


0.708 


BrcOl 


1100 


0.625 


0.627 


0.638 


0.688 



Table 2. Precision for all POS, sorted descending by the precision of 
Overlap measure 




CO (o cn CM m 



^ o 



Number of words 



Figure 3 . Precision in progress 



begin 

for m = 1, fei do 
for n — 1 , fc2 do 
forp = 1, fcs do 
Disambiguate triplet t^I^^ C3 in (Ci)*(C2)*(tS<3)* 
Calculate score( (Ci ) * (C^ ) * (C3 ) * ) 
endfor 
endfor 
endfor 

Calculate (m*,n*,p*) = ar5maa;(^,„,p)Score((Ci)*(C2)*(C3)*) 
Optimal translation of triplet is (Cr)''(C2)*(C3r 
end 

Let us remark that (t™*)*, for example, is a synset which corresponds to the best 
translation for wi produced by CHAD algorithm. However, since in Romanian are used 
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many words linked by different spelling signs, these composed words are not found in 
the Romanian-English dictionary. Accordingly, not each Romanian word produces an 
English correspondent as output of the above algorithm. However, many translations are 
still correct. For example, the translation of expression vreme trece (in the poem "Glossa" 
of our national poet Mihai Eminescu), is Word: (Rom)vreme (Eng)Ageij^n^A , Word: 
(Rom)trece (Eng)Flow^v^l . As another example from the same poem, where the 
synset of a word occurs (as an output of our application), //ne toate minte, is translated 
in Word: (Rom) tine (Eng) Keep^v^S :{keep, maintain}. Word: (Rom) toate (Eng) 
All^adv^Z :{wholly, entirely, completely, totally, all, altogether, whole}. Word: (Rom) 
minte (Eng) Judgment^n^2 : {judgment, judgement, assessment}. 

5.2. Application to text entailment verification. The recognition of text entailment is 
one of the most complex task in Natural Language Understanding lfT4l . Thus, a very 
important problem in some computational linguistic applications (as question answering, 
summarization, segmentation of discourse, and others) is to establish if a textfollows from 
another text. For example, a QA system has to identify texts that entail the expected an- 
swer Similarly, in IR the concept denoted by a query expression should be entailed from 
relevant retrieved documents. In summarization, a redundant sentence should be entailed 
from other sentences in the summary. The application of WSD to text entailment verifi- 
cation is treated by authors in the paper "Text entailment verification with text similarity" 
in this Volume. 

6. Conclusions and further work 

In this paper we presented a new algorithm of word sense disambiguation. The algo- 
rithm is parametrized for: 1. all words (that means nouns, verbs, adjectives, adverbs); 2. 
all nouns; 3. all verbs. Some experiments with this algorithm for ten files of Brown 
corpus are presented in section 4.2. The stemming was realized using the list from 
http://snowball.tartarus.org/algorithms/porter/diffs.txt The precision is calculated rela- 
tive to the corresponding annotated files in SemCor corpus. Some details of implementa- 
tion are given in 4. 1 . 

We showed in section 5 how the disambiguation of a text helps in automated translation 
of a text from a language into another language: each word in the first text is translated 
into the most appropriated word in the second text. This appropriateness is considered 
from two points of view: 1. the point of view of possible translation and 2. the point of 
view of the real sense (disambiguated sense) of the second text. Some experiments with 
Romanian - English translations and text entailment verification are given (section 5). 

Another problem which we intend to address in the further work is that of optimization 
of a query in Information Retrieval. Finding whether a particular sense is connected with 
an instance of a word is likely the IR task of finding whether a document is relevant to a 
query. It is established that a good WSD program can improve performance of retrieval. 
As IR is used by millions of users, an average of some percentages of improvement could 
be seen as very significant. 
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